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Abstract: This paper investigates correct variable selection in finite sam- 
ples via l\ and l\ + I2 type penalization schemes. The asymptotic consis- 
tency of variable selection immediately follows from this analysis. We focus 
on logistic and linear regression models. The following questions are central 
to our paper: given a level of confidence 1 — <5, under which assumptions 
on the design matrix, for which strength of the signal and for what values 
of the tuning parameters can we identify the true model at the given level 
of confidence? Formally, if I is an estimate of the true variable set I* , we 
study conditions under which P(7 = I*) > 1 — 8, for a given sample size n, 
number of parameters M and confidence 1 — <5. We show that in identifiable 
models, both methods can recover coefficients of size -h= , up to small mul- 

tiplicative constants and logarithmic factors in M and j. The advantage 
of the i\ + I2 penalization over the l\ is minor for the variable selection 
problem, for the models we consider here. Whereas the former estimates 
are unique, and become more stable for highly correlated data matrices as 
one increases the tuning parameter of the I2 part, too large an increase in 
this parameter value may preclude variable selection. 
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1. Introduction 

The literature on various theoretical aspects of l\ empirical risk minimization 
has enjoyed substantial growth over the last few years, partly as a necessity to 
complement the flourishing field of convex optimization. The main attraction, 
from both theoretical and computational perspectives, is the proved ability of 
such methods to recover sparse approximations of the true underlying model 
when the number of parameters is large relative to the sample size. The principal 
theoretical topics of interest are therefore focused on optimality properties that 
involve the notion of sparsity. Whereas the theoretical properties of the i\ + £2 
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penalized estimates, sometimes referred to as elastic net estimates, a phrase 
introduced by [23] in linear models, have not been investigated for the models we 
consider, the properties of the i\ penalized estimates, typically referred to as the 
Lasso-type estimates, have received considerable attention. The topics studied 
range from finite sample results concerning sparsity oracle inequalities for the 
risk of the estimators, in regression and classification, e.g., [4, 5, 19, 26, 20, 2, 11] 
to the asymptotic behavior of the estimates, including the consistency of subset 
selection, e.g. [9, 10, 13, 22, 21, 25, 6, 3, 17, 12, 14]. 

This work is motivated by the emergence of a large number of variations and 
improvements of the t\ penalization schemes in regression and classification. To 
appreciate the need for such variations it is important therefore to investigate 
the limitations of the original method. When the number of variables AI is 
large relative to n, an asymptotic analysis of the variable selection problem 
may obscure issues that arise in finite samples. In this paper we investigate the 
finite sample accuracy of variable selection via the £i and the closely related 
^i + (-2 penalization schemes in regression models. We also discuss asymptotic 
alternatives and asymptotic consequences of our results. Our goal is to review 
existing results, and to offer a self-contained, back-to-back analysis of these 
important models and respective penalization schemes. 

Formally, let (X^Y;), 1 < i < n, be i.i.d. pairs distributed as (X, Y) with 
probability measure P, where Y e {0, 1} or Y G E and X = (Xi, . . . ,X«) £ 

R M . We assume that E(F|X = x) = g(T, je i* P*j x i)i whcrc r ^ i 1 '-- ■ > M i 
is an unknown subset and g is a known link function. In our analysis, M is 
allowed to depend and be larger than the sample size n, and the size of I* may 
depend on n. The goal of this paper is to provide an understanding of the merits 
and possible limitations of variable selection via these two penalization schemes 
when used to answer the following central questions: given a level of confidence 
1 — 5, given the number of variables M and the sample size n, under which 
assumptions on the design matrix, for which strength of the signal and for what 
values of the tuning parameters do we identify the true model at the given level 
of confidence? Formally, if I is an estimate of /*, we study conditions under 
which P(7 = 7*) > 1-5. 

We will focus on variable selection in logistic regression, corresponding to the 
link function g(z) = e z / (1 + e z ), and also present a full analysis of the problem 
for linear models, corresponding to g(z) = z, to facilitate the comparison of the 
results. We will conduct separate analyses of the corresponding estimates, as 
different arguments are needed for models with possibly unbounded response, 
such as the linear model. 

We denote by (3* the vector in R M with components (3* for j € I* and zero 
otherwise. We begin our analysis in Section 2 by establishing upper bounds 
on the t\ distance between the Lasso and elastic net estimators, respectively, 
and the parameter (3*. These results are connected with the sparsity oracle 
inequalities recently obtained for the Lasso estimators in [4] and [2], in linear 
regression models, and [19], in generalized linear regression models. The focus 
in these works is on the predictive performance of the estimators, rather than 
on the accuracy of variable selection, as considered here. For us, these results 
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are an intermediate, albeit essential, step in discussing the conditions under 
which an estimate / of the set I* satisfies P(7 = J*) > 1 — 8. It is intuitively 
clear that if the estimates (3 are too far from (3* , we cannot hope to recover the 
true coefficient set I* with high probability. It is interesting to note, however, 
that under some conditions on the design matrix, we can still estimate the true 
subset correctly even if the distance between j3* and the estimates is not close 
to zero, but can still be controlled as in Section 2. Although this may appear 
surprising, it is this phenomenon that sets the variable selection problem apart 
from the problem of estimating well (3* itself: here we aim at identifying a non- 
zero coefficient. Even if the estimate of this coefficient is relatively far from the 
real value, it only matters whether it is different than zero, not whether it is 
very close to the truth. 

The rest of paper is organized as follows. In Section 2.1 we re-visit the condi- 
tions on the design matrix under which sparsity oracle inequalities for the Lasso 
estimates have been previously established and discuss weaker conditions. In 
Sections 2.2 and 2.3 we show that these results continue to hold under the 
weaker conditions. If one considers a slight modification of the £\ penalty that 
consists in the addition of a properly scaled £i term, one can further weaken the 
requirements on the design matrix, while maintaining the sparsity of the result- 
ing estimator. This motivates the study the l\ + £2 estimates, which have not 
been, to the best of our knowledge, investigated theoretically from this perspec- 
tive in these models. Section 2.3 also contains an alternative asymptotic analysis 
of the £\ norm of the difference (3 — (3* for estimates in logistic regression, mo- 
tivated by the presence of possibly large constants in the finite sample oracle 
bounds. Under weak conditions on a weighted version of the design matrix, we 
obtain improved oracle bounds, that hold with probability converging to one. 

In Section 3, which is central to our paper, we discuss in detail when the 
Lasso and the elastic net methods can provide accurate variable selection, in 
linear and logistic regression models. We show that obtaining results of the type 
P(J* = I) > 1 — 8 depends crucially on a combination of conditions on the 
design matrix and the signal strength. This analysis complements the existing 
asymptotic results for Lasso estimates in linear regression models, and shows 
that similar phenomena occur in generalized linear models, for which the vari- 
able selection problem has not been investigated from this perspective; we refer 
to the the very recent work in [17] for related results in binary graphical models. 
Moreover, we provide the parallel study of the elastic estimates, and investigate 
to which extent they can be used for variable selection. We note that in a non- 
asymptotic framework, the study of ¥(I* = I) is well posed only if / is unique. 
Since the elastic net estimates of (3* are unique, as shown in Appendix B, so is 
the corresponding /. Recall that, in contrast, the Lasso- type estimators of (3* 
may not be unique. However, in that case, the problem studied here is still well 
posed: even when the Lasso estimates of (3* are not unique, the corresponding 
/ is. This property has been used implicitly in [15], and then in [13], for linear 
models, without an explicit proof, and not investigated outside linear models. 
For completeness, we present a proof of this result in Appendix B. 
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In the Conclusions section we summarize our findings and discuss the relative 
merits of the Lasso and elastic net estimates. The proofs of our main results are 
in Appendix A. Additional technical results are collected in Appendix B. 

1.1. Notation 

In the following sections we will denote by (3 the penalized least squares esti- 
mates, for both the l\ and l\ + £2 penalties and, similarly, by (3 the penalized 
logistic regression estimates, for either penalty. The estimates are of course dif- 
ferent, but we opted for the same notation to keep the exposition simple. It will 
always be clear from the context to which combination model/penalty they cor- 
respond to. In the same way, / will always denote the set of selected variables, 
and I* will denote the set of truly associated variables. We denote by k* the 
cardinality of I* . For simplicity, we assume that the observations on the X vari- 
ables arc normalized and centered, that is — V™ , Xf 4 = 1 and — V" , X; , = 0. 
for all j. This is in no way crucial, but it allows for cleaner results and easier 
interpretation of the assumptions. We will also assume that for all i and j the 
variables Xy are bounded by a common constant L > 0, with probability 1. For 
any vector in a G M. M we denote by |a|i = Ylj—i the l\ norm of a vector. 

2. Sparse balls for the i\ and l\ + £2 penalized estimates 

In this section we establish upper bounds on the t\ balls \(3 — j3*\i and \(i — 
/3*|i, for the Lasso and elastic net estimates, in linear and logistic regression, 
respectively. We show that these bounds are, up to constants that we make 
precise below, of the form k*r, where r is the tuning parameter corresponding 
to the l\ penalty and k* is the number of non-zero components of (3* . Since the 
l\ norm is a sum of M terms, but the bound only involves the unknown and 
possibly much lower dimension k* , we call the corresponding balls sparse. 

2.1. Conditions on the design matrix 

In [5] and [19] it was showed that the Lasso type-estimates belong to sparse 
l\ balls centered at the true parameter, in linear models and generalized linear 
models, respectively. These results were established under variants of a condition 
on the design matrix typically referred to as the mutual coherence condition, 
introduced in [8]. We state below a mild version of this condition, which we will 
also use in Section 3 of this paper. Let 



Condition Identif: We assume that there exists a constant < d < 1 such that 
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This condition guarantees separation of the variables in the true set /* from one 
another and from the rest, where the degree of separation is measured in terms 
of the size of the correlation coefficients. We regard it here as an identifiability 
condition. It will be used as a sufficient condition for correct variable selection 
in Section 3 below. However, it is not needed for sparse oracle inequalities, as 
we detail below. 

In Sections 2.2 and 2.3 below we show that Condition Identif can be relaxed 
if one is only interested in prediction or the global behavior of the estimates 
measured, as in these sections, by the £\ distance to the truth. To formulate the 
weaker condition let a > 0, e > be given. Define the set 

V a , t = L GR M : £ \ Vj \ <a^ \vj\+e\ . (2.1) 
[ Hi* jei* J 

Let E be the M x M matrix with entries pkj ■ 

Condition Stabil. Let a, e > be given. There exist < b < 1 such that 




= 1, for any v £ V c 



Remark. We denote generically one of the estimates of (3* studied below by (3. 
We will motivate the definition of the set V a . t by showing, in the course of the 
proofs of Theorems 2.2-2.7, that (3 — (3* G V Q , e , with high probability, for spe- 
cific parameters a and e. For instance, we will show that a is either 3, for the 
t\ penalized estimates, or 4, for the t\ + £2 penalized estimates. The parameter 
e will be either zero, for the least squares estimates, or exponentially small, for 
each M and n, in the case of the logistic regression estimates. The term e in 
the definition of V a ^ is needed for purely technical reasons, and does not affect 
the results or their interpretation. Condition Stabil corresponding to a = 3 and 
e = has been introduced, for an analysis similar to the one we conduct here, 
by [2] , for a comparative study of the predictive performance of the Dantzig and 
Lasso estimators in linear models. 

One possible intuitive interpretation of Condition Stabil is as follows. If e = 0, 
Condition Stabil is immediately implied by P(E — £>D > 0) = 1, where D is the 
M x M matrix containing the k* x k* identity matrix corresponding to indices 
in /*, and with zero elements otherwise. This asserts that the correlation matrix 
remains semi-pozitive definite if we decrease the diagonal elements correspond- 
ing to the true variables slightly, and leave all other entries unchanged. Since 
this modification affects only k* of M 2 entries, it can be regarded as a stability 
requirement on the correlation structure. Condition Stabil is even milder than 
P(E — 6D > 0) = 1, since it is only required to hold for v G V a<e , for some given 
a and e. 

The following lemma establishes the relationship between the two conditions, 
and shows that Condition Stabil is less restrictive. A brief argument establishing 
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this link is also offered in [2], for a = 3 and e = 0; we include a full proof here 
for the general case, for completness. 

Lemma 2.1. Let a > and e > be given. If Condition Identif holds for 
some < d < l/(l + 2a + e), then Condition Stabil holds for any < b < 
l-d(l + 2a + e). 

Proof. Let E* be the k* x k* matrix with entries pkj, k,j g I*. For any v £ M. M 
denote by i>* the vector in M. k obtained from v by retaining only the components 
corresponding to I* . Then 

v't,v > v'j^v* -2 y] ip*!jiKii«fci 

> t^E*?^ — — [fj- 1 under Condition Identif 



2de 



1/2 



> u££*v» - 2ad ^ u] - "^V^f ^ w| j , by Cauchy -Schwarz 

> — (2a + e)d u| — cfe, since 2xy < x 2 + y 2 

> (1 - d(l + 2a + e)) ^ ^ 2 - e. 

The last inequality follows from Condition Identif which also implies that 
v'+Yfv* > (1 — d) J2jei* v j an< ^ so Condition Stabil holds for any b with < b < 
l-d(l + 2a + e). □ 

Thus, for instance, for the study of the Lasso estimates in linear models, we have 
a = 3 and e = and so if Condition Identif holds for some d, then Condition 
Stabil holds for < b < 1 — 7d, which imposes the restriction < d < = . 

The results of Sections 2.2 and 2.3 below will be established directly under the 
less restrictive Condition Stabil, which requires the specification of a constant b. 
Notice that if b is very small, the condition is almost a tautology, as £ > by 
construction. However, as it will become apparent from the results established 
below, a very small value of b will increase the radius of the l\ balls covering 
the estimator. This motivates the parallel study of the clastic net estimates. We 
show that they are less affected by potentially small values of b. 



2.2. Sparse £i balls for estimates in linear regression models 

Throughout all sections on linear regression in this paper we assume that the 
model generating that data is E(Y\X. = x) = 2jgz« fi*j x i' f° r ^ ^ M an d 
I* Q {!>•••) M}. This is the most popular model for regression with unbounded 
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response Y. It is also becoming increasingly common in regression models with 
Y € {0, 1}, when the data supports it. Its usage in this context dates back to [ I]. 

2.2.1. An £\ penalized least squares estimator 
We estimate (3* by 

. n M 

= arg min - V {Y - /3'X ?; } 2 + 2r V \(3 3 \ , (2.2) 

where r =: r n ,M (8) is a tuning sequence depending on n, M and a user specified 
parameter 5. In what follows we determine r such that P(|/3 — > Crk*) > 
1 — 8, and we make C > precise. 

In the following theorem we will use Condition Stabil corresponding to the 
set V a>e defined in (2.1) for e = and a = 3. Let cr 2 = Var (Y) and recall that 
L denotes a common bound on Xj, 1 < j < M. 

Theorem 2.2. Assume that Condition Stabil is satisfied for some < b < 1. 
// we choose 

/ 2 In 2^1 r i 

r > 2 W j/ Ye {0,1}, 

y ?i 

or 



then 



l n 4M ln 4M 

r > 4LcrW £- V 8L «/ Y e R, 



1/3-/3*1! < ^rk* 



for j3 given in (2.2). 

Remark 1. In practice one can replace a in the tuning sequence by an estimator, 
as discussed in detail in [4]. 

Remark 2. It is interesting to note that although the results above indicate that 
the radius of the l\ ball is small if k*r < 1, the proofs make no use of this 
restriction on k*; in particular k* > y/n is allowed. It is clear that in this case 
the bounds are large but, perhaps surprisingly, this does not affect the validity 
of variable selection, for some design matrices. We discuss this in detail in the 
next section. 



Theorem 2.2 above shows that the bound on \(3— (3* \\ becomes large if Condition 
Stabil is satisfied only for very small values of b. One remedy is provided by a 
slightly modified estimator, which retains the sparsity properties of the Lasso 
estimates, but is less affected by small values of b. The modified estimate will 
be penalized least squares with a combined t\ and ti penalty and we discuss it 
in the next subsection. 
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2.2.2. An l\ + £2 penalized least squares estimator 
We estimate now (3* by (3, where 

n M M 

(3 = arg min - £ {Y t - /3'XJ 2 + 2r £ |& I + c (2.3) 

^ " i=l 3=1 3=1 

As before, the goal is to find r =: r Ut M(S) and c =: c ni M(<5) for which we can 
construct sparse balls for the estimates. 

In the following theorem we will use Condition Stabil corresponding to the 
the set V a , e defined in (2.1) for e = and a = 4. 

Theorem 2.3. Assume that Condition Stabil is satisfied for some < b < 1. 
7/maxjg/* |/3*| < B, for some B > 0, independent of n, and if 



'21n- 2M 



r>2\l^-^, c=^, if ^6 {0,1}, 



or 



r 



l n 4M ln 4M 

r>ALa\ s - V 8L c=— , Ye 

V n n 2B 



then 

( - 425 \ 

PH/3-/3*|i < J >!-^ 

/or (3 given in (2.3). 

Remark. The result above shows that even if Condition Stabil holds with b very 
close to 0, the bound on 1/3-/3*1! stays finite, for any given M and n. Note that 
it may still be large, since c is restricted to take relatively small values, dictated 
by the sizes of r and B. However, we cannot choose a much larger value for c: 
in that case the £2 penalty would become prevalent, and no estimates will be 
set to zero in finite samples. 

2.3. Sparse tx balls for estimates in logistic regression models 

We denote the logistic loss function by 

1(13) =: l([3;x,y) =: -y/3'x + log(l + expfj'x), 
and denote by PZ(/3) = E/(/3; Y, X) the associated risk. Define 

[3* = argminPZ(/3). 

Throughout all sections on logistic regression we will assume that 

exp(V . j, (3*Xj) 
■W*-,)-FM- 1 + Jjgj^ )) . (") 
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2.3.1. An i\ penalized logistic regression estimator 
We estimate f3* by 

n ( M I M \ 1 M 

[3 = arg min - £ I -Y t £ ft + log 1 + exp £ ft A%- I + 2r |ft | . 
73 n i=l [ i=l \ i=l / J j=l 

(2.5) 

We will determine the tuning sequence r = r nt M(S), different than the one 
above, for which we can construct sparse balls for these estimators. We will an- 
alyze the estimates under the assumption that p{x) is bounded away from zero 
and one for all x. This is implied by: 

Assumption A: There exists D > such that < D, 

in combination with the assumption that all X variables are bounded by L. 

In the following theorem we will use Condition Stabil corresponding to the 
set V a e defined for 

log2 1 

6 ~ 2( MVn ) +1 X r ' 

with r given below, and for a = 3. Also, let s be a constant depending on L and 
.D, which decreases with D. 

Theorem 2.4. Assume Condition Stabil is satisfied for some < b < 1. // 
Assumption A holds and if 



n 4(M V n) ' 

then 

\P-f3*\i<\rk* + (l + -)e) >1-S, (2.6) 
so r J 

for [3 given by (2.5). 

Remark. Notice that the term (1 + i)e is roughly 2 m v „ and therefore negligible. 

As noted above, the bound on |/3 — /3*|i becomes large for very small values of 
6, which motivates the study of the t\ + £2 penalized estimators in the next 
section. Also, the constant 1/s, the exact form of which is given in the course of 
the proof of this theorem, can be very large for large D; similar results, based 
on different arguments and slightly more restrictive assumptions on E have 
also been obtained in [19]. However, if we content ourselves with asymptotic 
statements, we can obtain an improved bound on \[3 — /3*|i, with 1/s replaced 
by a quantity arbitrarily close to 1. These results will hold with probability 
converging to 1, and under more stringent requirements on the design. They are 
based on the following fact, which is of independent interest: it establishes the 
sup-norm consistency of (3'x as an estimate of f3*'x. 
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Proposition 2.5. Let 5 =: 5 n be any sequence converging to zero with n. If 
maxjg/* < B, for some B > 0, independent of n, and k*r — ► 0, then for 
any r\ > we have 

sup \0x - /3*'x\ >T}) — ► 0, 



as n ^ oo. 



Remark. It is interesting to note that this result holds independently of the as- 
sumptions on the design. 

The next result is obtained under a condition similar to Condition Stabil, but re- 
quired to hold for a weighted version of the matrix S. Let g(z) = e z / (l + e z ). Let 
Si be the M x M matrix with entries i Yh=i g' X^X^Xji, 1 < j, k < M. 

Condition LStabil. Let a > 0, e > be given. There exist < b < 1 such that 

v'Y,\v > b Vj — e J =1, for any ij G V^ it - 

Theorem 2.6. Assume Condition LStabil is satisfied for some < & < 1. Lei 



> (6 + + 2i J^i) . 1 



?i 4(M V n) ' 

/or any sequence 5 n — > 0. //maxj g /* |/3*| < B, /or some i? > independent of 
n, and k*r — > 0, then 

< 4 rfc * + ( 1 + -) £ ) — ^ ( 27 ) 
tjjo r / 



/or /3 given by (2.5) and for a constant w arbitrarily close to 



one. 



2.3.2. An l\ + H.2 penalized logistic regression estimator 

In this section we obtain similar results for estimators of (3* given by 



in \ | - Y *J2 + lo § f 1 + cx p E & x v j | 



/3 = arg min 

A/ M 

+ 2r]T 1/3,1 +c^/3f. 

3=1 3=1 

In the following theorem we will use Condition Stabil corresponding to the set 
V a e defined in (2.1) for 

log2 1 

6 ~ 2( A/V ")+! X r ' 

for r given below, and a = 4. Let s > be the constant given in Theorem 2.4 
above. 
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Theorem 2.7. Assume that Assumption A holds and that Condition Stabil is 
satisfied for some < b < 1. Let B > such that max |/3|* < B and take 



r > (6 + 4W 2l0g2 ( MV ") + * , + 



r 



4(MVn) V « ' 25 ' 



27ien 

p(V-rii<^^ + (i + -H>i 



r 



/or /3 <?wen by (2.8). 

The comments and remarks of Section 2.2.1 apply here with no change: the 
ii + (-2 penalized estimates are more stable, in that the radius of the l\ ball 
covering the estimate is less affected by small values of b and s. However, care 
must be exercised in choosing too large a c, as in that case the sparsity properties 
will be lost. We can also derive, in a similar manner, versions of Proposition 2.5 
and Theorem 2.6 for the l\ + £2 penalized estimate. Since the results are nearly 
identical we do not include them here. 



3. Correct subset selection 

The asymptotic properties of subset selection via the Lasso in linear models 
have been studied recently by a number of authors: [ 1 3] studied selection Gaus- 
sian graphical models, [21] investigated subset selection in linear regression on 
for what was termed incoherent design matrices, [3] studied approximating re- 
gression models under design matrices satisfying Condition Identif introduced 
in Section 2 above and previously discussed in [4], and [25] investigated a three 
stage procedure in linear models. A nice overview of the connections between 
incoherent design matrices and matrices satisfying conditions similar to our 
Condition Identif is given in [14]. An interesting asymptotic analysis, in which 
one studies the interplay between the sample size n, the sparsity level k* and the 
number of variables M for average asymptotic consistency in linear regression 
models with Gaussian design is presented in [24]. There the coefficient set I* is 
assumed to have been selected uniformly at random from {1, . . . , M}, and one 
studies asymptotically the average error probability, where one averages over all 
possible choices of I* . We refer to the work of [6] for a non-asymptotic investiga- 
tion of the accuracy of model selection via the Lasso in linear models, but under 
model assumptions different than ours: the coefficient set I* is again assumed 
to have been selected uniformly at random from {1, . . . , M}, and conditionally 
on I* the signs of f3j, j £ I*, are assumed to be equally likely to be 1 or -1. 
The properties of the Lasso-type estimates used for correct subset selection in 
logistic regression have not been investigated from the perspective considered 
here. After finishing this paper, we learned of the recent work of [17], which 
investigates the very related topic of asymptotic model selection consistency 
in binary graphs; we comment on connections with our work in Section 3.2.2. 
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The finite sample properties of variable selection via the elastic net have not 
been investigated in either of the models considered here. For a discussion of 
its usage in linear regression models with different target parameters than those 
considered here we refer to [23]. 

We study in this section the non-asymptotic merits of the Lasso and elastic 
net estimates when used for variable selection. We conclude the section with the 
asymptotic implications of these results. 

All estimates of (3* analyzed in Theorems 2.2-2.7 have zero coefficients. These 
theorems, however, do not necessarily guarantee that the corresponding set of 
the non-zero coefficients of these estimates is exactly equal with 7*, with high 
probability: we can either omit some of the true variables or include variables 
that do not belong to 7* while still being able to control the radii of the l\ 
balls. In this section we find estimates of (3* that have the properties discussed 
in Section 2 and for which, in addition, we have P(7* = 7) > 1 — 7, for some 
given small 7 > 0. Since P(7* = 7) > 1 - P(7* % 7) - P(7 % 7*), we find the 
subset 7 such that 

P(7* C 7) > 1 - 71 and P(7* D 7) > 1 - 72 , 

with 71 + 72 = 7. 

3.1. Correct inclusion of all true variables in the selected set 

In this section we discuss conditions under which we can obtain results of the 
type 

P(7* C7) > I-71, 

for some given 71 > 0, for estimates having the properties discussed in Section 2 
above. Lemma 3.1 below shows what governs the size of P(7* C 7). We discuss 
in detail to which extent we can use the results of Section 2 directly for this 
study. Recall that the cardinality of 7* is k* . 

Lemma 3.1. Let /3* and be a combination parameter/ estimator from Sec- 
tion 2. Let 7 be the index set of the non-zero components of (3. Then 

p(/*gj) <p(J/3-/3*|i> mmlAl). (3-1) 
Proof. The following display follows directly from the definitions of 7 and 7*. 



P(7* % 7) 



< 




7 for 


some j G 7*) 


< 


m -- 


- and /3* 7^ 0, for some j G 7*) 


< 


n\h 


-%\ 


= |/3*|, for some j £ 7*) 


< 






> min |/3*|, for some j € 7*) 


< 




-/?*| 


1 > m i n IA*IJ ■ 



□ 



F. Bunea/ 'Honest variable selection 



1165 



3.1.1. Detection of large signals 

The purpose of this subsection is to point out that the study of P(J* C I) via 
a direct application of the sparse oracle bounds derived in the previous section 
may lead to suboptimal results. We argue this in what follows. Inequality (4.15) 
in the proof of Lemma 3.1 above makes it clear that the rate at which P(7* % I) 
decays is governed by how small we can make the probability of estimating a 
non-zero component of (3* by zero. However, if we further bound this probability 
and arrive at (3.1), we can use Theorems 2.2-2.7 of the previous section directly. 
We thus arrive at the following corollary. 

Corollary 3.2. Let < 5 < 1 be fixed. Assume Condition Stabil holds, for the 
parameters specified in the statements of Theorems 2. 2-2. 7, respectively. 

1. l\ penalized least squares in linear models: 

If miiijg/. |/3*| > \rk* with r given by Theorem 2.2, then P(J* C j) > 
1 -5. 

2. i\ + £2 penalized least squares in linear models: 

If j^rk* < mhijg/* |/3*| < max^g/* \(3* \ < B, for some B > 0, and with 
r and c given by Theorem 2.3, then ¥(I* C 2) > 1 — 5. 

3. l\ penalized logistic regression: 

If miiij-g/. 1/3*1 > -j^rk* + (1 + i)e, with s, r and e given by Theorem 2.4 
and if Assumption A holds, then P(/* C I) > 1 — 5. 

4- £1 + (2 penalized logistic regression: 

V j$-J k * + (1 + t)^ < min jeI , |/3*| < max jeJ , \(3*\ < B, with r, c and e 
given by Theorem 2.7 and if Assumption A holds, then P(J* C /) > 1 — 5. 

Remark. The lower bounds on the minimum size of the true coefficients stated 
in Corollary 3.2 are all of the type 

min 1/3* I > CYfc*, (3.2) 
possibly up to the small additive term e defined in the previous section. 

For stable design matrices, when the constant C is close to 1, and if the true 
model is supported on a space of dimension fc*, with very low k* satisfying 
rk* < 1, then such lower bounds imply that we can detect moderate sized 
signals. Clearly, for large k* , the lower bounds on the coefficient size are too 
conservative, especially since the constant C may also be large. We discuss 
below when one can weaken this requirement. 
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3.1.2. Detection of weaker signals 

Propositions 3.3 and 3.4 below show that the lower bounds on the signal strength 
can be significantly weakened under further conditions on the design matrix. The 
intuition is the following: if the signal is very weak and the true variables are 
highly correlated with one another and with the rest, one cannot hope to recover 
the true model with high probability. We will therefore work, for the remainder 
of this paper, under the assumption that the true model is identifiable, as quan- 
tified in Condition Identif stated in Section 2 above. Recall that this condition 
only requires that the true variables be separated from on another and from the 
rest, and it does not impose any restrictions on the variables placed outside the 
true set. 

Detection of weak signals via l\ and l\ + £2 penalized least squares in 
linear models. 

We show below that if the idcntifiability condition is met, then we can recover 
coefficients with sizes above the noise level n" 1 / 2 . The following result shows 
that, if the identification is to be performed at some given confidence level 8, 
the size of the signal will also depend on 8. Moreover, it will depend on M, 
via a logarithmic term: this is the price to pay for simultaneous identification 
of the true variables, among all M possibilities. In what follows we will use 
the following tuning parameters, depending whether Y <E {0, 1} or Fel. Let 
< 8 < 1 be fixed. Let K be an upper bound on k* . Since k* is unknown, one 
can always use the conservative bound M. However, if in practical situations K 
is known, one can use it instead of the larger bound M. Consider 



2\n( 2KM ) 

r > 2\ V 6 ' , if Y £ {0,1} (3.3) 



or 



hn*m ln^M 

r > ALa\ 5 — V 8L - — , if Y~ e R. 

y n n 

Proposition 3.3. For r given above we assume that 

min |/3*| > 2r. 
jei* J 

(1) If Condition Identif is satisfied for d < and I corresponds to the l\ 
penalized least squares estimate, then 

p(r c 1) > 1 -8 — . 

V - ) - M 

(2) We assume, in addition, that max^g/* \(3* \ < B for some B > 0. We choose 
c = 2% . If Condition Identif is satisfied for d < y=r| and I corresponds to the 
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(■l + (-2 penalized least squares estimate, then 

p(i* c i) > i-5 . 

Remark. Notice that although Corolarry 3.2 is established under the weaker 
Condition Stabil, it only guarantees that P(J* C I) for the collection I* for 
which minjg/* \/3*\ > j; r k* . In contrast, if Condition Identif holds, we can 
detect variables corresponding to the set I* for which minjgj. \/3*\ > 2r. This 
is a substantial relaxation of the lower bound on the signal strength, which no 
longer depends on cither the possibly large k* or the possibly small b. Similar 
relaxations of the requirements on min^g/* have also been obtained by [24] 
and [6], but for models in which I* is assumed to be random, as discussed at 
the beginning of Section 3. 

Proposition 3.3 above allows an immediate comparison between the selection 
properties of the Lasso and the elastic net. Their behavior is almost the same, 
the only difference is in the restriction on the constant d: slightly larger values 
of d can be allowed for the elastic net estimate. This translates into saying that 
if the correlations between the true variables, and between the true variables 
and the rest are slightly larger than what is allowed for the Lasso, then the 
^i + ^2 penalized estimate may provide an alternative. However, as we noted 
in Section 2, although it would be tempting to increase the value of c, in order 
to allow for a larger degree of correlation, this would result in not setting any 
components of the estimate to zero. 

Detection of weak signals via l\ and li+li penalized logistic regression. 

The identifiability condition needed for linear models needs to be adjusted to 
the nature of the logistic regression model, in a manner similar to that of replac- 
ing Condition Stabil by Condition LStabil. We impose below a new condition: 
a weighted correlation matrix should exhibit the same type of separation we 
required of the correlation matrix of the data. The weights depend on the link 
function. This perhaps comes with little surprise: the correlation matrix appears 
explicitly in the expression of the least squares estimates in linear models, and 
this is not typically the case for other models and estimates. We formalize this 
below. For a given < S < 1, M and n, let 

> ( , + ^lJ^mm + , L ^m + ' , ,3.4, 

V n y n A(M V n) 

log 2 1 

e — 2( Mv ")+ 1 X r' 

where we recall that L is a common bound on the Xj's. Let d be as required by 
Condition Identif. Recall that for such < d < 1 there exists a < b < 1 for 
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which Condition Stabil holds, as specified in Lemma 2.1. For this b define 

M 

a* \ .. 



U = < a £ E" : max 

KKn 



4Lrfc* r , 1 . 
< — |Il + -e 
so r 



for s > given inThcorcm 2.4. The definition of U is justified by the properties 
of the estimates /3 discussed in Section 2, which have been proved under Con- 
dition Stabil and Assumption A. Let g(z) = e z /(l + e 2 ). 

Condition Lidentif. Let <i be the constant required by Condition Identif. We 
assume that 



sup F max 

a€U \ 0^1*^3 



1 " 

Eg' (ai)XijXi 



n 



< — 



Remark 1. We give a formal justification of this condition in the course of the 
proof of Proposition 3.4 in Appendix A below. It is a natural condition that 
appears via a linearization of the likelihood function. The term containing e in 
the definition of U is exponentially small, and can be essentially ignored for 
practical purposes; its role is purely technical. 

Proposition 3.4. Let r and e be as in (3.4) above and s > as in Theorem 
2.4- Let Assumption A hold. 

(1) Assume that Conditions Identif and Lidentif are met with d < 16+2 ^ 7+e - | , 
for a set U corresponding to b < 1 — d(7 + e). // 

min 1/3*1 > 3.5r + 3(1 + -)e, 
jei* J r 

and I corresponds to the i\ penalized logistic regression estimate then 

P(I* C I) > 1 — 35. 

(2) Let B > be such that max^g/* < B and choose c = Assume 
that Conditions Identif and Lidentif are met with d < 17+ ^g + ^ , for a set U 
corresponding to b < 1 — d(8 + e). // 

min 1/3*1 > 3.5r + (1 + -)e, 

kei* r 

and I corresponds to the i\ + £2 penalized logistic regression estimate then 

P(I* C 7) > 1 - 35. 

Remark 2. Notice that if g(x) = x is the linear link, Condition Lidentif be- 
comes Condition Identif. Since e is exponentially small, the requirement on the 
minimum size of the coefficients is essentially 

min \(3l\ > 3.5r. 
fee/* 
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As discussed in the remark following Proposition 3.3 above, Corollary 3.2 
shows that P(J* C I) can also be controlled under the less restrictive Condition 
Stabil, but in that case we can only recover sets /* corresponding to the large 
signal strength min^g/* \(3*\ > ^rfc* + (l+i)e. In contrast, Proposition 3.4 shows 
that we can detect weaker signals, however the correlation structure needs to 
follow the more restrictive Conditions Lidentif and Identif. As discussed before, 
similar properties are valid for the elastic net estimate, for an appropriate choice 
of the tuning sequence c. Refinements of this result, that replace the possibly 
small constant s by a term close to 1 are possible, if instead of statements that 
hold with probability larger than 1 — <5 we consider statements that hold with 
probability converging to one. For this, one can use Proposition 2.5 and Theorem 
2.6. Since these results are very similar to those above, we do not include them 
here, for brevity. 



3.2. Correct subset selection 

The set estimates / of the previous section satisfy P(J* CI) > 1 — 71, for an 
appropriate 71. In what follows we show that / also satisfies P(I C /*) > 1 —72, 
thereby guaranteeing that P(I = I*) > 1 — 7, for 71 + 72 — 7. 



3.2.1. Correct selection via the Lasso and the elastic net in linear regression 
models 



Theorem 3.5. Let K be an upper bound on k* and take 



21n(^i) r , 

r>2\ if Fe {0,1}, 



or 



I j 4KM J 4KM 



r > ALcj\\ - V 8L ^— , if Y e K. 

V n n 

Assume that 

min 1/3* I > 2r 

(1) Assume that Condition Identif is met for d < j^. If I corresponds to the 
l\ penalized least squares estimator, then 

F(I = P) > 1-3(5 . 

v 1 ~ M 

(2) Assume, in addition, that maxj S /» |/?*| < B for some B > and choose 

c = 2jj. If Condition Identif is met for d = jj£ and I corresponds to the 
(-1 + £2 penalized least squares estimator, then 

F(I = P) > 1-3(5 . 

v ; - M 
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Remark. Since k* is unknown, one can always take K = M. However, if in some 
instances one has a rough idea of the order of magnitude of k*, one can use that 
value instead of the conservative bound M. The remarks on the relative merits 
of the Lasso versus the elastic net from the previous sections apply here with 
no change. 

Recall that the Lasso parameter estimates (3 may not be unique. However 
the set estimates I are unique, for each given tuning sequence r. This result, 
which we prove in Appendix B, is needed throughout the paper to ensure that 
the problem is well posed. We mention it again here, since it will be used con- 
structively in the proof of Theorem 3.5 in Appendix A. 

Theorem 3.5 has immediate asymptotic implications. It guarantees that 7* 
will be consistently estimated by 7 if M, the number of candidate variables is 
polynomial in n, i.e M = O(n^), for some ( > 0. To obtain this result it suffices 
to replace 5 by any sequence converging to zero with n. For instance, choosing 
5 = 1/n and restating the value of r in terms of order of magnitude we have the 
following corollary. 

Corollary 3.6. Let r = O(J^) and assume that mm jeI , \(3* \ = O(J^). 
Then, under the assumptions (1) or (2), respectively, of Theorem 3.5 we have 

lim P(7 = 7*) = 1, 

n — >oo 

for I either the £\ or the £\ + £2 penalized least squares estimator. 

3.2.2. Correct variable selection via £± or £\ + £2 penalized logistic regression 

In this subsection we show that the type of results that hold for £\ or £\ + £2 
penalized least squares continue to hold for penalized logistic regression, under 
requirements on the correlation matrix that are tailored to this type of loss 
function. 

Theorem 3.7. Under the assumptions of Proposition 3.4 we have: 

(1) If I corresponds to the l\ penalized logistic regression estimate then 

P(7* = I) > 1 - 56. 

(2) If I corresponds to the £\ +£2 penalized logistic regression estimate then 

P(7* = I) > 1 - 53. 

The asymptotic implications of Theorem 3.7 are again immediate. If M is poly- 
nomial in n and for S — 1/n we therefore obtain: 

Corollary 3.8. Let r = O(J^) and assume that min jeI , \[3* \ = O(J^). 

Then, under Assumption A and the assumptions on the design required for (1) 
or (2), respectively, of Theorem 3.7 we have 

lim P(7 = 7*) = 1, 
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for I either the i\ or the i\ + l 2 penalized logistic regression estimator. 

Remark. In the proofs of Corollary 3.8 and Theorem 3.7 above we invoked 
Theorem 2.4 of Section 2.3.1 and therefore used its hypotheses. If instead one 
invoked the asymptotic result of Theorem 2.6, one could obtain a version of 
Corollary 3.8 with Assumption A replaced by the conditions max^gf |/3£| < B 
and rk* — > 0. For polynomially large M , the order of our tuning sequence is r = 
0(!^). The condition rk* -> therefore places the restriction k* < C (lo ^^ 
on the size of the true model, for some positive constant C . In this context, 
similar results, based on different arguments, have been independently obtained 
by [17], under the slightly more stringent requirements k* < an d 
minfc e /» |/3£ | > p-, but under slightly more relaxed conditions on the weighted 
matrix of the design. 



4. Conclusions 



The scope of this paper is to offer finite sample, non-asymptotic, benchmarks 
on the performance of the Lasso and the closely related elastic net methods 
for variable selection in logistic and linear regression methods. We showed that 
the methods can be used for correct variable selection in identifiable models, 
where we defined idcntifiability via Condition Identif and Condition Lidentif. 
The added requirement for correct selection, versus good prediction, is on the 
size of the signal strength: we can detect coefficients larger than a small constant 
multiplied by the tuning parameter of the £± penalty. This tuning parameter is a 
function of n, M and the level of confidence, 5. The size of the tuning parameter 
has to be larger than the noise level, typically of order up to factors that 

are logarithmic in M and 4. Our contribution can be detailed as follows. 

Lasso and the elastic net in linear regression. The properties of the l\ penalized 
least squares in regression models are becoming well understood, while those of 
the £1 + £2 penalized least squares have not been investigated from this perspec- 
tive. We complemented the existing results on the Lasso estimates by providing 
a refinement of assumptions. We showed in Section 2 that the l\ penalized esti- 
mates belong to sparse t\ balls under Condition Stabil, also proposed in [2] . We 
included a full proof of this result to facilitate the comparison with the elastic 
net estimates, which allow for a slightly higher degree of correlation between 
the X variables than the one permitted by the Lasso estimate. We discussed in 
Section 2 the precise interplay between this degree of correlation and the choice 
of the tuning parameters. If the tuning parameter of the £2 term is smaller than 
the tuning parameter of the l\ term, this estimator is also sparse: it belongs 
to a sparse £\ ball centered at the true value and can be used to recover the 
true coefficient set /* with high probability. However, care must be taken when 
using this estimate: if the tuning sequence accompanying the extra £ 2 term is 
too large we would essentially have a ridge regression estimate, and no variable 
selection will be performed. 
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In Section 3 we provided a non-asymptotic analysis of the subset selection 
problem in linear models, which complements the existing asymptotic results. 
We showed that the signal detection boundaries suggested by previous asymp- 
totic analyses can be relaxed. In the works of [21] and [3], which investigate 
aspects of selection consistency, the minimal signal strength is required to be 
n~ 2 + e 5 for some > 0, up to unspecified and possibly large constants. The work 
in [3] requires Condition Identif from Section 2 above. In [21] a less restrictive 
assumption on the design matrix is imposed, namely the irrepresentable design 
condition, which is almost necessary and sufficient for the sign consistency of 
the estimators, which implies consistent subset selection. The work of [14] uses 
a coherence- type condition similar to our Condition Identify which is shown to 
be a sufficient condition for the sign consistency of a further thresholded Lasso 
estimator. The price to pay is a stronger requirement on the minimum size of the 
detectable coefficients: this size depends on sequences involved in the definition 
of their coherence condition and k* . These requirements are similar in spirit to 
those discussed in our Corollary 3.2 above, and share similar drawbacks. 

We showed here that if one concentrates directly on the study of P(7 = /*), 
instead of sign consistency, and studies the original (untruncated) Lasso esti- 
mator under Condition Identif, one can relax the requirement on mhijgj. 
We showed in Theorem 3.5 that one only needs minjg/* \ f3*\ be larger than 




, up to small constants independent of the design. For M polynomial 



in n and the choice 5 = 1/n one can therefore detect, with the untruncated 
Lasso, coefficients of order 0{\J ^-^). 

Lasso and the elastic net in logistic regression models. We showed in this article 
that the £\ and £ \ +£2 penalized logistic regression estimators have features that 
are similar to £\ and £\ + £2 penalized least squares estimators, but the study 
of the estimates depends on conditions on a weighted correlation matrix of the 
data. 

The predictive performance and adaptation to unknown sparsity of the Lasso 
penalized estimates in generalized linear models received very little attention, 
with the notable exceptions of [19, 2G] and [11] in regression and classification. 
Here we revisited some of these issues, and showed that the £\ penalized logistic 
regression estimators, as well as the elastic net estimates belong to sparse £\ balls 
under the weaker Condition Stabil. The size of the radii of these balls can be 
improved asymptotically under Condition LStabil. We also showed that the £\ + 
£2 penalized logistic regression estimators, which have not yet been investigated, 
exhibit the same adaptation to unknown sparsity as the Lasso estimates, for 
appropriate choices of the tuning parameters given in Section 2.3. We showed 
in Theorem 3.7 that, similar to linear models, £\ or l\ + £2 penalized logistic 
regression can be used to estimate I* with very high probability. The difference 
is in the conditions on the correlation matrix, which need to be adapted to 
the nature of this model, as in Condition Lidentif. The size of the coefficients 
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that are detectable via this method is also of the order 0(y l ° 9 n - ), where the 
constants involved in this bound are independent of the design or sparsity level. 



Appendix A 

Proof of Theorem 2.2. Let X, be the M dimensional vector with entries Xy, 
1 < j < M. For ease of notation, let r n> M(^) = t. By the definition of the 
estimator, and with Wi — Yi — E(Yi\X.i), we obtain 



(^-/3)'|iE x ^}(^-^ 

M ( n n 



(4.1) 



3=1 I 



Af A/ 

+ 2r J2 |#|-2r£l&l- 



Define the event 



E 



< r 



(4.2) 



Notice that on the event A display (4.1) yields, via simple algebra, that 

£i&-#i^ 3 Ei&-#i- 

Therefore, on the set A we have (3 — (3* <G V, with U defined in (2.1), for e = 
and a = 3. 

Adding r\/3 — /3*|i to both sides of (4.1) and re-arranging the terms we also 
have 



w - I - E x * x 1 os* - + ri£ - < ir E i$ - & 

I n 1=1 ) jei* 



(4.3) 



Using the Cauchy-Schwarz inequality in the right hand side of the inequality 
above, followed by an inequality of the type 2uv < au 2 + v 2 /a, for any a > 1, 
we further obtain 

~ I' E X » X 1 ^* - i9) + Hi9 - < 4ar 2 r + ^) 2 . 

Since /3 — /3* € V we can invoke Condition Stabil and, by taking a = 1/6, we 
obtain, on the set A, that 

\P-P*\i < \rk\ 
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To conclude the proof we determine now r — r n ,M{S) such that P(^4 C ) < 5. If 
Y G {0, 1} we use Hocffding's inequality to obtain 



M 

P(A C ) < ^ ] 
3=i 



and the choice 



2 x - 



r n ,Af( s ) ^ 2 1 



> r < 2Mexp(-w 2 /8), 



(4.4) 



'21n(2A£) 



guarantees that P(A C ) < 5. If Y e R we use Bernstein's inequality to obtain 

M 



w < 531 



- E ^ 



> r 



(4.5) 



< 2M exp 



nr nr 



and the choice 



'ln^ 8L 4M 
— v — In 

n n a 



guarantees that V(A°) < 5. This concludes the proof. 



□ 



Proof of Theorem 2.3. Using the definition of the estimator, the fact that 
maxjg/* \(3*\ < B and our choice of c = ^ we obtain, on the event A, that 

E \Pi-pj\< 4 Ei^--^i- 

Therefore, on the set A we have (3 — (3* G V, with 1/ defined in (2.1), for e = 
and a = 4. 

We use the same reasoning as in Theorem 2.2, and invoke Condition Stabil 
to obtain the analogue of display (4.3). The only difference is that we complete 
the square generated by the ti part of the penalty: 

b E (ft ~ ^') 2 + c E - + r |/3 - /3* | ! (4.6) 

jei* jei* 

< 2c ^ /?*(/?* -&)+4r El^-^l' 

and so, under the assumption that max., e j* |/3*| < _B and our choice of c = 
we obtain, for any a > 1 

(6 + c) 53 ((3* - ft) 3 + r\p- /3*|i < 4.25aFr 2 + I £ (/£ - &) 2 , (4.7) 



and the remaining part of the proof is identical to that of Theorem 2.2, if we 
now choose a = xi~ ■ d 

b+c 
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Proof of Theorem 2.4-. Recall that we denoted the logistic loss function by 

l(J3) =: x, y) =: -y[3'x + log(l + cxp/3'x), 

and the associated risk by PZ(/3) = EZ(/3;Y,X). We also denote the empirical 
risk by 



\ E | - r * E + lo § f 1 + exp E ft^ 1 j | 



With this notation and letting r = r n _M(5), the estimator satisfies, by definition 

M M 



P n Z(/3) + 2r£ < PJ(f3*) + 2r^ \Pj 

3=1 3 = 1 



By adding and subtracting P(Z(/3) -l(/3*))+r Y,f=i \Pj - P*j I to both sides and 
rearranging terms we obtain 

r|/9-/3*|i+p(lC9)-Z(/r)) < (P n -P)(l(f3*)-10)) +r \p - /? |(4.8) 



M M 



2r£|/3*|-2r£|# 

3=1 3 = 1 



Let 

(P n -P)(iQ3*)-ZQ3)) 

L n — Sup — — ■ . 

/3eR M \P — P \i + e 

Notice first that if we change the ith pair (Xj, Yj) while keeping the others fixed, 
the value of L n changes by at most , where L is a common bound on all Xij . 
To see why, recall that P„ = —^2™— 1 8x. i ,Yi ' s the empirical measure putting 
mass 1/n at each observation (Xi,Y,). Let P^ = i (V^iZii^i ^Xi.Y; + #x' ( ,y/) 
be the empirical measure corresponding to changing the pair (X; , Y{) to (XJ ,Y/). 
Then 

(p„ - p)(z(/?*) - ic9)) (p; - p)(w - to 



|/3-/3*|i + e |/3-/3*|i + e 

1 /(/?*; y t , XQ - ZQ3; Yj, X) - fQ3»; y/, Xj) + ZQ3; Yj', Xj) 
n |/3-/3*|r+e 

<g <1£ (49) 

" n |/3-^*|i + e " n ' 1 j 

where the inequality follows immediately by a first order Taylor expansion and 
the assumption that all X variables are bounded by L. Therefore we can apply 
the bounded difference inequality (e.g. Theorem 2.2, page 8 in [7]) to obtain 
that 

9 

nu 

P(L„-EL„> U )<exp-— 2- 
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Thus, if we take 

/21ogi 
y n 

we have 

P(L„ - EL n >u)<5. 

We will use Lemma 3 in [2(>] to obtain a bound on EL n . We re-state it here for 
ease of reference, adapting it to our notation. 

Lemma 3, [20]. Let J n be an integer such that 2 J ™ > n. Then, if L n defined 
above corresponds to a Lipschitz loss and the components of X are bounded by 
L, with probability one, thenEL n < ^/gi£ g 2 (^ /v ") _|-(7 2 _Ja_^ ; where C\,C2 
are positive constants depending on the Lipschitz constant and L. 

Our loss is Lipschitz in t = (3'x, with constant 2. Also, inspection of the chaining 
argument used in the proof of the Lemma shows that we can take J„ = (M V n) 
and e = 2(fl 1 °^ n 2 )+1 x i. Therefore, by making the constants precise we obtain 



EL n < 6L 

Define the event 



21og2(MVn) 1 



n 4(M V n) ' 



E = {L n <r}. (4.10) 
From the previous displays we then conclude that if 



/21og2(A/Vn) 1 /21ogT 
r > 6L\ — + — — r + 2L\ 

V n i(M V n) V n 

then P(E) >1 — S. 

Since P - > 0, by the definition of /?*, display (4.8) yields 



M M MM 

i=i j=l 3=1 i=i 

M M M M 

< 2r^ l^-l +2r^ |/3*| +2r^ |^*| -^2r|^-|+ re 

j=i i =1 j =1 i =1 

< 4r|/3*| 1 +re, 

on the set E. Therefore, if Assumption A holds, we obtain 

M 

<4D + e<5A (4.11) 
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where we used the possibly conservative bound D on e to keep the exposition 
clear. 

By Example 4.5 in [18] we have ¥10) -¥l(/3*) > \\g~- gp. || 2 , where g p {x) = 

i+exp(/yL) an<4 II II ^ S ^ 2 norm w hh respect to the distribution of X. A first 

order Taylor expansion gives g~{x) - g^{x) = ^^f^y (fp(x) ~ fp*(x)), 

where fp{x) = /3'x and (3'x is an intermediate point between (3'x and /3*'x. Let 
A = 6LD, and let s = (1 + e A )~ 4 . Then, since Assumption A and (4.11) hold, 
we have \\g~- g P A\ 2 > 4fj~ 3>ll 2 - 

Thus, on the event E, display (4.8) further yields 

M M 

r52\0i-0j\ + 4fp-fli4 3 < E 2r l^-^l ( 412 ) 

3=1 3=1 

M M 

3=1 3=1 

Via simple algebra, display (4.12) yields 

£ <3£ 1^.-/3*1 + 6, 

jei* 

on the set E. Therefore (3 — (3* G V, for the set V given by (2.1) of Section 2, 
with a = 3 and e as in the statement of the theorem. 

Let 7 fci = EXkXj, for G {1, . . . , M} and let T be the M x M matrix 
with entries -y kj . Notice that \\f-x~ /,a*|| 2 = ((3* - /3)T(/3* - /?) and so, using a 
reasoning identical to the one used in display (4.3) of Theorem 2.2, we further 
obtain 



M _ _ 1 

r E I A' + <P* - ~P)< ^r 2 k* + -Y, (ft* ~ ft)' + re - 

3=i 3'e/* 



Since on the set i? we have (3 — (3* G V, we can use Condition Stabil. The 
condition implies that (J3* — 0)'T(J3* — j3) > sbj^jei* (fij — ft ) 2 — e and so 

M _ _ 1 

r E ift - ft* i + s5 E (ft - ft) 2 ^ 4arV + - E (ft - ft) 2 + ( r + i^ 4 - 13 ) 

3 = 1 36^"* 3'6-f* 

Taking a = l/s6 we obtain, on the set -E, that 

Eift-fti^ + ( 1 + -)- 

3=1 

Since we have shown above that P(E) > 1 — 5, the proof is complete. □ 
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Proof of Proposition 2. 5. First notice that on the event E defined in the previ- 
ous theorem, display (4.8) yields 



P (l0) - Z03*)) < 4r|/3*|!+re. 



Then, the assumptions of this proposition imply that the righthandside of the 
above display converges to zero with n and so we have that for any > 

p(|P(iG8)-iG8*)|>i?)->0, (4.14) 

since F{E C ) < S n -> 0. 

Observe that Pl((3) = PxPy|x^(/3), where we regard the expectations as being 
taken with respect to a pair (X, Y) independent of the sample. By the definitions 
of the loss I and p{x) we have 

Pr|x=*i(/J) = log(l + - p(x)0'x. 

Let 9 > be arbitrary, fixed. Simple algebra shows that if sup.,, \j3'x — j3*'x\ > 
then F Y \x=xKf3) > ^Y\x=xK0*)> f° r au x i an d so there exists fig > such that 
P(Z(/3) - l(/3*)) > Then 



sup \(3'x - p*'x\ >e) <P (p(Z(/3) - Z(/3*)) > ??e) 



0. 



where the convergence to zero follows by (4.14) above. This concludes the proof 
of this proposition. □ 

Proof of Theorem 2.6. The proof differs from the proof of Theorem 2.4 above 
only in the way we obtain the lower bound on P(Z(/3) — l{(3*)). For quantities 
defined in the discussion immediately following display (4.11) above we write 



cxp(P'x) 



(l + cxp{P'x)) 



^(#"0 -/*•(*)) 



where we recall that f3' x is an intermediate point between (3' x and /3*'x and 
that we defined p(x) = 
An be the set for which 



that we defined p{x) = i^xp(p'^x) • ^ e t $ > be arbitrarily close to zero. Let 



sup \/3'x - P*'x\ < 9, 

x 

and recall that, by Proposition 2.5 we have V(Ag) — > 1. On the set Ag we have 



1 + cxp((3*'x) 



exp(^-^) ( ;— — ) > e -»=:„, 



2 



F. Bunea/ 'Honest variable selection 



1179 



for all x and with w arbitrarily close to 1. Let T\ be the matrix with en- 
tries Ep(X)(l - p(X))X k X jl for k,j e {1, . . . , M}. Therefore, on Ag we have 
llSg — 9/3* II 2 > w((3* — /3)Ti(/3* — (3). Invoking condition LStabil we obtain 

(/?* - p)'Tx{P* -P)> wbY, j€ i.(Pj - P*) 2 and the rest of the proof carries on 
unchanged, with results holding now on the set Ag H E. □ 

Proof of Theorem 2.7. The proof is identical to the one of Theorem 2.4 above, 
up to the following display 

M M M 

r E 1^ ~ %\ + sb E - ^) 2 + c]T^ - c]T/3* 2 

3=1 je-f* 3=1 3=1 

<4rJ2\P*j-M + (r + l)e. 
jei- 

To arrive at this display we observe that the elastic net satisfies 

£i&-/5i^ 4 Ei&-^i +e > 

jW* 3'ez* 

and so (3 — /?* 6 7, for the set V given by (2.1) of Section 2, with a = 4 and e 
as in the statement of the theorem. Therefore the use of Condition Stabil in the 
derivations above is valid. 

For the remaining of the proof we reason as in Theorem 2.3 above. We com- 
plete the square in the left hand side of the inequality above and invoke the 
assumption max^gj* < B to obtain 

M 

3=1 3'6-f* 36/* 

< 4r E 1^3 - &l + (r + l)e + 2ci3 £ - ^|, 

3'e/* 3'e-f* 

which immediately implies, by choosing c such that 2aB = r, that 

M 

^El^-^l + ^ + ^E^-^) 2 ^ 2x2.5r£|/3*-&| + (r + l)e. 

3=1 3'6-f* 3'6/* 

Then, we use again the Cauchy-Schwarz inequality followed by 2xy < ax 2 +y 2 /a 
to obtain 

M 

r Eift-^i + ( s& + c )E(^-ft) 2 

3=1 ie/* 
< (2.5) 2 aFr 2 + - Pi) 2 + ( r + 
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Choosing now a = gives, on the event E defined in (4.10) above 

^IA--/3*|<^fc* + (l + -)e. 
' so + c r 

3=1 

Since we showed in the proof of Theorem 2.4 that ¥(E C ) < S, this completes 
the proof. □ 

Proof of Proposition 3.3. Recall that we denoted the cardinality of /* by k*. 
First observe that by the definitions of I and I* and by the union bound we 
have 



P(J* % I) < W>[k<£l for some keV 

< P (p k = and PI ^ 0, for some jfe G T 



< k* maxP l/3 k = and /3£ ^ 

k£l* \ 

We first show that P(J* C J) > 1 — 8 — 4j for the £i penalized least squares 
estimator. It follows immediately from Lemma 4.1 in Appendix B below that if 
/3k = is a component of the solution (3 then 



-EI^-E&M** 

3=1 



< 2r. 



Therefore 

p(r % i) 

< k* max P (p k = and /3| 7^ 
2 



< fc* max 
fee/* 



fc* max 
kei* 



M 






Yi - PjXjj 




< 2r 


3=1 







E 

t=i 



2/3 * + _ J2 WiX ik 



»=1 



(=1 



< 2r 



< fc*maxP||/3*| 



1 n - /l ™ \ 

- £ W** - E(& - U ^ XiiXifc 

i=l j^fe \ i=l ) 



< r 



k* max 
fee/* 



1 " 

n ^ — 



■E&-# 



1 n 

-E x ^ Xi 



1=1 



> 1/3*1 -r 



where the penultimate inequality follows by the triangle inequality |a + 6 + c| > 
|c| — \a\ — \b\. Under Condition Identif and since min je /» > 2r we further 



F. Bunea/ 'Honest variable selection 



1181 



obtain 



< 



k* max ] 

fcG/* 



1 

- V WiX ik 

n ^ — ' 



i=l 



> r/2 ) + k*¥[ \P~P*\i > 



rk* 
~2d 



We argue exactly as in the course of the proof of Theorem 2.2 to bound the 
probabilities above. We use either Hoeffding's inequality, for Y <G {0, 1} or Bern- 
stein's inequality, for Y £ R to bound the first term by < for r given 
by (3.3). Similarly, for this choice of r, we have 

Ark 

< k* x — < S, 
K ~ ' 



k*F |/3-/r| ! > 



for a constant 6 for which Condition Stabil holds. By Lemma 2.1 in Section 2, 
Condition Identif implies Condition Stabil with 6=1 — Id. Notice that for this 
value of b we have l/2d > 4/6 for d < 1/15, as required in statement of this 
theorem. Therefore, combining these results we obtain 

P(I*£7) < A + 5. 

We establish now similar results for the t\ + £2 penalized least squares es- 
timator. By the characterization of a zero component of the solution, given in 
Lemma 4.3 in Appendix B below, we also have 



M 



<2r, ft?0 



and so the proof is identical to the one above. The only modification is in 
terms of constants: in this case Condition Identif implies Condition Stabil with 
6=1 — 9d. From Theorem 2.3 we obtain for the choice of r given by (3.3) that 



P 1/3-/3*1! > 



4.25rfc* 



< 



K 



As above, we note l/2d > 4.25/(6 + c) for d < Invoking now Theorem 2.3 
with these constants concludes the proof. □ 

Proof of Proposition 3.4. As in the previous proof, recall that we denoted the 
cardinality of I* by k* and that 



»(/*£/) < 



< 



k I ior some k £ I* J 

(p h = and pi ? 



k* max 
kei* 



We begin by establishing the result for the l\ penalized estimator. By Lemma 
4.1 in the Appendix below it follows that if = is a component of the solution 
(3 then 



1 

— > Xik- 



1 

- Y YXrk 
i=l 



< 2r. 
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Let now 



M 



Sn 



3 = 1 l-'j l^j 



ex P E,= i 4' X V 1 + ex P E j L i Z 3 ] 



Then, since Yi = Yi — p(Xi) + p(Xi) =: Wi + p(Xi), where p(Xi) is given by 
(2.4), we obtain: 



"(I* %I) < k* max 



Define 



/ n 

hP S n - - V 

M ~ A n \ 



<2r; ftjtO 



Recalling that ± E"=i ^ffe = 1 

we obtain, for every k <E I* , that 



<P LBJ-IS^-BJ 



( S„ - B„ + B n - - WiX, 

V 71 i=l 

1 - 

-J2w t x, 

z— 1 

/ 1 n 

i#i- E^-^bE^ x 

\ j¥* «=1 

/ n 



/9* ^Oj 
<2r; /3*^0 



1 5n -Bra I 



n \ 
-VW 4 X lfe <2r 



M 



i ™ 



>2 + (1 + ^ 



^„-B„| > £ + (l + -)e 
2 r 

where the last inequality follows from the assumption that min,- e /» \f3* \ > 3.5r- 
3(1 + -)e. We bound the first term above using Hocffding's inequality: 



( 1 ™ 

-E^*« 

\ i=i 



(4.15) 



since, in particular, r > 2L 

If Condition Identifholds, we can bound the second term of the last inequality 
of the display above by 



rk* 



1 



l/3-01i> 25- + (! + ->]< 



(4.16) 
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as in Theorem 2.4, if ^ > ^, with b given by Condition Stabil. By Lemma 
2.1, Condition Identif implies Condition Stabil with 6=1 — d(7 + e), and so the 
restriction on d is d < a . L, > . 

— 8+s(7+e) 

It remains to bound the term P(|5„ — B n \ > | + (1 + £)e. For this, let c/(z) = 
e 2 /(l + e z ) and notice that Taylor's formula gives g(u) — g(v) = g'(a)(u — v), 
for a point a between u and v, where < g' (a) < 1. Therefore 



,1/ 



|fl- n -i? n | 



i " 

- 2(1 - sfifli^XaXik 

71 * 



(4.17) 



here is a point between Pj x ij and X)j=i fijXij, for each i, and so 



a/ 



M 



< L^lft" for each 



Let 



M 



Therefore, by Theorem 2.4, for b chosen as in the discussion following display 
(4.16) above, we have 

¥(G c n ) < 5. 
Notice that on the event G n we have 



M 

3=1 



^ ArLk* r 1, 

< ; h L(l H — )e, for each i. 

so r 



This justifies the definition of the set U in Condition Lidentif. 
Combining the results above with (4.17) we obtain 

S n -B n \ > I + (l + i)de) 
2 r 



5. 



1 " 

- V(l-.g'(a 4 ))^^. 



2 r 



Note that if Condition Identif and Lidentif both hold for <i/2 then 

d 



1 

- _ 9 '( a i)) X i3 X ik 



< 



Thus, if d < 16+2s s (7+e) 
(4.16) above we have 



and with 6 chosen as in the discussion following display 



\S n - B„ \ >- + (! + -)de) < ¥(G c n n G n ) + 



M M 
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Therefore, collecting the bounds above, we obtain 

nr£i)<^-<36. 

The result for the t\ + £2 penalized estimator follows in an identical manner. 
By Lemma 4.3 in Appendix B below, if [3k = is a component of the solution 
(3 then 



If „ expg^X 



t=l 



1 



ex PY, 1 jLlPj X i. 



1 

- V YiX ik 

71 > 



< 2r. 



Therefore the remainder of the proof is identical to the proof above, if we invoke 
Theorem 2.7 instead of Theorem 2.4. □ 

Proof of Theorem 3.5. In light of Proposition 3.3, it is enough to show that 
P(J C J*) > 1 — 25, for both the £\ and ^1+^2 penalized least squares estimators. 
We begin by showing that P(I C I*) > 25 for the £\ penalized estimate. Let 



1 

n ^ 

i=l 



is/* 



2^ £ M, 



and define 



fi = argminft,(/i). 



(4.18) 



Let 




£ («-£&*< 



x, 



< 2r 



Let, by abuse of notation, ju G R M be the vector that has the components 
of /z in positions corresponding to the index set /* and components equal to 
zero otherwise. By standard results in convex analysis, e.g. Lemma 4.1 in the 
Appendix B below it follows that, on the set B, is a solution of (2.2). Recall 
that (3 is a solution of (2.2) by construction. By definition (3k 7^ for k 6 /. By 
construction, /ifc 7^ for k £ S C I* , for some subset 5. By Proposition 4.2 in 
Appendix B, any two solutions have non-zero elements in the same positions, 
therefore / = S C I* on B. Hence 




=1 



£ - £ MjXi 



r 

- 2 



A 



ik 



> 2r 



£(^-/3;)(i£^Aj 



r 

~ 2 



< 



E P ( n 

k=l \ 

-E 1 
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r 

~ 2 



k0> 
M 



k=l \ 



jei* \ i=i / 

n 

J2 W i X 



r 

~ 2 



r 

~ 2 



Hi* \jei* 



where we used Condition Identif to obtain the last inequality. Recall now that 
if Y € {0, 1} the choice 



r > 21 



'21n(2M) 



guarantees, as in display (4.4) of the proof of Theorem 2.2, that 



E P ( n 

k=l \ 



£ W i X ik 



>-2 ^ 



Repeating now the proof of Theorem 2.2, with /3 replaced by fl and using only 
the variables corresponding to I* , we obtain 



I2-/Hi< 



rk* 
~2d 



on the set 



£ W * Xi - 



i=l 



< r 



By Hocffding's inequality 



I) < 2k* exp(-w 2 /8), 



and the choice 



r n , k *(S) > 2] 



/ 2 ln(^M) 



(4.19) 



implies that f(A\ ) < S/M, which in turn implies that 



E p Eifc-si^h* 

HI* 



Here we used again the fact that, by Lemma 2.1, Condition Identif implies 
Condition Stabil and then reasoned as in Proposition 3.3 to conclude that the 
analogue of Theorem 2.2 can be used, for d < j^. 
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The same conclusion holds if Y G R, by invoking Bernstein's inequality as 
in (4.5) and corresponding value of r from the statement of this proposition, 
instead of Hoeffding's inequality. 

Of course, the choice in (4.19) is not implementable, as k* is not known 
in practice, and we can always replaced it by a known upper bound, or the 
conservative bound M. This completes the proof for this part of the proposition. 

It remains to show that P(7 C /*) > 1 — 26 for the l\ + £2 penalized estimate. 
We reason as above and let 

1 ™ 

m ^ = n E^ - £ VjXv} 2 + 2r N + c £ Hp 
1=1 j- e j» jgj* 

and define 



/i = argmin m(p). 



(4.20) 



Then, by Lemma 4.3 in the Appendix B, b = (p, 0), where is a vector corre- 
sponding to indices in I* c , is a solution of (2.3) on the set 



*= n 



jGl* 



< 2r 



Recall that (3 is a solution of (2.3) by construction, and that by Lemma 4.3 
in the Appendix B, the solution is unique. Since, on the set B, bk = for 
k e I* c , by construction, and ft = on I c , by definition, we conclude that 
I C I* on the set B. Therefore the proof is identical to the one above, where we 
now invoke Condition Identif with d < y^r| and the analogue of the proof of 



Theorem 2.3. 



□ 



Proof of Theorem 3. 7. By Proposition 3.4, it is enough to show that P(I C 
I*) > 1 — 2(5 for both the £\ and l\ + £2 penalized estimate. 

We begin by showing that P(J C /*) > 1 — 25 for the £\ penalized logistic 
regression estimate. Let 

1 - 

H (H) = - E {-^M'X, + log(l + exp/ZX,)} + 2r V 

i=i jei* 

and define 



Let 



bi= n 



l^i = argmin H(fj,). 



1 ™ 

n? ifc l 



YiXj, 



(4.21) 



< 2r 
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Let, by abuse of notation, Jl € R M be the vector that has the components 
of Ji in positions corresponding to the index set /* and components equal to 
zero otherwise. By standard results in convex analysis, e.g. Lemma 4.1 in the 
Appendix below it follows that, on the set Bi, ju is a solution of (2.2). Recall 
that j3 is a solution of (2.2) by construction. Then, by Proposition 4.2 in the 
Appendix B, any two solutions have non-zero elements in the same positions. 
Since, on the set Bi, (3k — for k G I* c we conclude that I C I* on the set B\. 
Hence, reasoning as in Theorem 3.5 above 



p(/gr) < p(Bf) 




E 



i=l 



> r 



E IMj" 

j G I* 



1 ™ 

- E g'(a,i)XijXik 



> r 



i=l 

E Itj-Pjl 



> 



1 n 

- ^ g'(ai)XijXi 

n — J 



> 



< S +J2 P [ E 1^-^*1 >rk*/2d 



^ 6 + E p ( { E l& - ^1 ^ rr M n A, ) + E wj, 

fcgi* \ ^ei* J / fcg7* 

where, as in (4.15), we used Hocffding's inequality to bound by S the first term, 
we used Condition Lidentif for the second term, and where 



Dn=\ ^|^-/3;|<^ + (l + i) £ 
I jei* 



with 6 = 1 — d(7 + e) and 



log 2 



1 

2(MVn)+l X ~r~' 



Notice that by the definition of e and r, and since < b, s < 1, we always have 
(1 + f)e < Thus, for our choice of d, we have rfc*/2d > + (1 + ±)e. 
Therefore 



c a g n < E p P« n °n)+E p (^) = 5 + E p (^) ^ <^+E p ra- 



k4I' 



k4I* 



k4I' 



k=l 
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Repeating now the proof of Theorem 2.4, with (3 replaced by fi, where we are 
now using only the variables corresponding to /* we obtain that 

|/Z-/3*|i<|rfc* + (l + -)e 
b r 



on the set 



a 2 = sup \(Vn-mm-m)\ <r 



pel 



\p-p*\i+e 



where, as in Theorem 2.4, we can show that P(j4|) < 6/M, for our choice of r. 
Therefore 

"<£!*)< 26. 



It remains to show that the result above also holds for the t\ + £2 penalized 
estimator. Define 



M(p) = 



\ it I ~ Yi J2 VjXij + log(l + exp ^HjXij) \ 



2r \»j\+ c J2^> 
jei* jei* 



(4.22) 



and let 



Let 




/i = arg min M(fi). 



lfe l + expV. gJ » V-jX- 



1 ™ 
fi ^ — j 



(4.23) 



< 2r 



Let, by abuse of notation, Jl £ M M be the vector that has the components of 
Jl in positions corresponding to the index set /* and components equal to zero 
otherwise. By standard Lemma 4.3 in the Appendix B below it follows that, 
on the set B%, fl is a solution of (2.2). Recall that f3 is a solution of (2.2) by 
construction. Also by^Lemma 4.3 in the Appendix B, the solution is unique. 
Since, on the set Bi, (3k = for k £ I* c we conclude that / C I* on the set B\. 
Therefore the remainder of the proof is identical to the one above. □ 



Appendix B 

4-1- Properties of L\ penalized least squares and logistic regression 
solutions 

The solution of the t\ penalized optimization problem may not be unique. How- 
ever, in this case, all solutions have zero elements in the same positions, as we 
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show below. We denote by y = {Yy, . , . , Y n ) and by X the n x M matrix with 
entries Xy. We let L(f3) = L(A", y; (3) be a function depending on the data and 
a parameter (3 G M. M . Let 



M 



(3 = argminL(/3) + A^|/3j-| =: argmin/(/3) 



(8 



(4.24) 



for some fixed A > 0. Let S* be the set of indices corresponding to the non-zero 
components of a solution (3 : 

S = {k: f3 k ^0, l<k< M}. 

Lemma 4.1. If L is differentiable in (3 and if for any minima P^ 1 ',^ 2 ' 

dL(pW>) _ dHpF>) 



for all 1 < j < M, 



(4.25) 



d[3 3 df3 3 

then all [3 satisfying (4-24) have non-zero components in the same positions. 

Proof. We recall that for any convex function / : R M — * R the subdifferential 
of / at a point (3 is the set Dp = {w G R M : f(u) - f{(3) > (w,u- /?)}. For the 
function / defined in (4.24) this becomes 



D fj = {w e 



w = VL(/3) + Xv}, 



where VL(/3) is the M-dimcnsional vector having 



dL(p) 



as components and 



v G 



is such that 



-1,1], 



i/ /9j > 
if (3, < 
if (3, = 0. 



By standard results in convex analysis, [3 G IR M is a point of local minimum for 
a convex function / if and only if G Dp, where G M. M . 
Therefore, f3 satisfies (4.24) if and only if 



0L(f3) 



9(3, 



AM, for all l<j<M, 



and so the index set S of non-zero components of a solution is given by 

dL([3) 



1 < j < M 



8(3, 



A 



Therefore, if (4.25) holds, S is the same for all solutions. 



□ 
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Proposition 4.2. Let L correspond to either the least squares or the logistic 
criteria. Let (3^ and (3^ be two minima of (4-24)- Then: 

(1) - (3^) = 0, for either estimate. 

(2) All solutions of (4-24), for either estimate, have non-zero components in the 
same positions. 

Proof. The proof uses simple properties of convex functions. First, we recall 
that the set of minima of a convex function is convex. Therefore, if /?« and (3^ 
are two distinct points of minima, so is pf3^ + (1 — p)(3^ 2 \ for any < p < 1. 
Re-write this convex combination as (3 2 + prj, where 77 = (3^ — (3^ . Then, recall 
that the minimum value of any convex function is unique. For clarity, we argue 
separately for the two estimates. 

£1 penalized least squares. By the above arguments we have that 
1 n 2 ^ 

n°)=:-E{ y *-^ (2) +w)'^} + A Ei^ (2) +wi= c . ( 4 - 26 ) 

n 1=1 3=1 

where c is some positive constant, for any < p < 1. By taking the derivative 
with respect to p of F{p) above we obtain 

n ( M \ n / M \ / M \ 

M \ 2 M 



+ i E E nix* + A E ^ si s n (^ (2 + m ) = 0. 
1=1 \j=i ) j=i 

Since the function a + bp is continuous in p then, on a small neighborhood U 
of p the sign of (3 s - ' + piy, for each j, will be constant. Therefore, on U, the 
first two and the last term of the display of above are constant with respect to 
p. Denoting the sum of these terms by C we have 

— e I E ^ Xi 3 + c = °> for any p e u - 

71 i=l \j=l ) 

By taking again the derivative with respect to p we obtain that X 77 = 0, which 
is the result stated in the first part of this Lemma. 

£1 penalized logistic regression. We argue as above that the value of the function 
G below evaluated at a point of minimum is constant, and we evaluate it at 
a convex combination of two minima, as before. Thus, defining G(p) as the 
quantity below 



1 n M 

~ E {- Y ^ {2) + H% + log(l + exp (2) + prf)'Xi)) + \0 

i=l j=l 



M 

(2) 
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we have that G(p) = c, for some positive constant c > 0. Reasoning as above, 
we can take the derivative of the above function twice, with respect to p. Then, 
on a small neighborhood p G V we have 



1 / M 



1 + expY,jLi(Pj + PVj)Xi 



0, for any p G V, 



which implies that ^2j=i WjXij = f° r au *j which in turn implies that Xr/ = 0, 
as claimed in part (1) of this proposition. 

The second part of the proposition follows trivially from the first part and 
by Lemma 4.1. It is enough to show that X{jS^ x > — (3^) = implies 



for all j. 



For the i\ penalized least squares estimator we have 

M 



dpi 



E 

?:=i 



Y 



J2^x i: 



k=l 



M 



X; 



2 2 

- Y YiXij - - E E PkXikXij , 



i=l k=l 



and the last term is constant across solutions if X'0^ — (3^) = 0, for any two 
solutions, and this is implied by part (1). 

For the i\ penalized logistic regression estimate we have 



dim 

df3k 



— — / ^ik „nf / 1 i^ik- 



This will be constant across solutions if YljLi PjXij, for all i, is the same for all 
solutions, which is again implied by the result in part (1). This concludes the 
proof of this proposition. □ 



4-2. Properties of the i\ + £2 penalized least squares and logistic 
regression solutions 

We discuss below a number of properties of the solution of the i\ + £2 penalized 
optimization problem. We begin by giving this result in terms of general likeli- 
hood functions and we obtain the results for our two examples as consequences. 
As in the previous sub-section, we let L{0) = L(X,y;(3) be any function de- 
pending on the data and a parameter (3 G IR M . Let 

M M 

j3 = argmini(/3) + AV* \Pj\ + ^E^l =: argmins(/3), (4.27) 

13 3 = 1 U 

for some given tuning parameters A, c > 0. We note that this solution is different 
than the one introduced in the previous subsection, but for brevity we use the 
same notation. 
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Lemma 4.3. If L is differentiable in (3 then a solution of (4-27) satisfies 



dL(p) 



dp, 



2 7 A 



+ 2c/3j 
dL{fi) 



A, ifPj^O, 



(4.28) 



< A, iffa 



0. 



Moreover, the solution of (4-27) is unique for both the square and the logistic 
losses, respectively. 

Proof. Appealing to the elementary properties of convex functions introduced 
in Lemma 4.1 and applying them now to the function s above we trivially obtain 
the first part of this Lemma. 

For the moreover part, let (S^ 1 ' and (3^ be two solutions of (4.27). We show 
below that ft 1 ' = ft 2 ' for the two losses under study. Since s is a convex function 
of /?, for either loss, any convex combination of solutions is a solution, and s(/3) 
is constant across solutions. Consider as before the convex combination [3 2 + prj, 
where r\ = ft 1 ^ — ft 2 \ Recall that the minimum value of any convex function is 
unique. Then, for the i\ + £2 penalized least square estimator we obtain: 



1 n 2 ^ 



M 



PVjl 



3=1 



(2) 



where c is some positive constant, for any < p < 1. Reasoning now exactly as 
in Proposition 4.2 above and taking the derivative with respect to p twice, we 
obtain 

- n M M 



«=1 3=1 3=1 

for allj, that is ft^> 



which immediately implies r]j = for allj, that is ft 1 ) = ft 2 \ 

The same conclusion can be obtained for the logistic regression estimator, 
where we now differentiate twice with respect to p the function G\{p) = G(p) + 

7Ejii(/3j 2) +PVj) 2 , with G(p) defined in display (4.27) of Proposition 4.2. This 
yields 



1 / M 



exp Sjli (ft + m | 2 2 = 

1 + expEjii(& + PVj) x ij j=i 



M 



lJ'=1 



Reasoning as above we again obtain ft 1 ) = ft 2 ) . This completes the proof of 
this Lemma. □ 
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